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A  video  can  be  mapped  into  a  mnltidimensional  signal  in  a  non-Encbdean  space,  in  a  way 
that  translates  the  more  predictable  passages  of  the  video  into  linear  sections  of  the  signal. 
These  linear  sections  can  be  hltered  ont  by  techniqnes  similar  to  those  nsed  for  simplifying 
planar  cnrves.  Different  degrees  of  simplihcation  can  be  selected.  We  have  rehned  snch 
a  techniqne  so  that  it  can  make  nse  of  probabilistic  distances  between  statistical  image 
models  of  the  video  frames.  These  models  are  obtained  by  applying  hidden  Markov 
model  techniqnes  to  random  walks  across  the  images.  Using  onr  techniqnes,  a  viewer  can 
browse  a  video  at  the  level  of  snmmarization  that  snits  his  patience  level.  Applications 
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1  Motivation 


People  joke  that  a  video  tape  is  a  “Write-Only  Memory”  (WOM).  Indeed,  in  many 
homes,  honrs  of  TV  programs  and  family  memories  get  videotaped  and  pile  np,  yet  very 
little  is  ever  viewed  again.  One  of  the  reasons  is  that,  with  only  fast-forward  viewing 
as  a  browsing  tool,  it  is  so  painfnlly  inefficient  and  time-consnming  to  review  previonsly 
recorded  material  or  search  for  specihc  footage  that  it  is  not  worth  the  bother.  Similarly, 
thonsands  of  honrs  of  video  data  are  becoming  available  online,  bnt  there  is  no  way 
to  qnickly  preview  this  material  before  committing  to  a  complete  and  often  lengthy 
download. 

However,  the  example  of  text  search  on  the  Web  demonstrates  that  even  imperfect 
search  tools  can  be  very  nsefnl  and  snccessfnl.  These  tools  attempt  to  rank  the  relevance 
of  the  search  resnlts,  so  that  the  nser  can  focns  his  attention  initially  on  material  that 
has  a  higher  probability  of  being  relevant  to  his  qnery. 

This  paper  describes  onr  approach  to  applying  this  insight  to  video  data.  We  pro¬ 
pose  to  summarize  videos  by  a  method  that  ranks  frames  by  relevance.  The  proposed 
mechanism  will  let  the  nser  say  “I  only  have  the  patience  to  download  and  go  over  the 
x%  most  nsefnl  frames  of  this  video  before  I  decide  to  download  the  whole  video”. 

We  wonld  like  to  select  frames  with  the  highest  nsefnlness.  At  hrst,  it  seems  that  this 
is  hopeless,  nnless  we  can  nnderstand  the  semantic  contents  of  frames  and  videos.  For 
example,  there  might  be  a  shot  that  scans  over  books  on  a  shelf  and  stops  at  the  title  of 
a  book  that  is  important  for  nnderstanding  the  story. 

However,  in  many  cases  there  are  syntactic  clnes,  provided  by  techniqnes  that  the 
cameraman  may  nse  to  convey  the  importance  of  the  shot  to  the  story.  In  many  cases 
the  camera  motion  corresponds  to  the  motion  of  the  eyes  of  a  snrprised  viewer.  The 
snrprised  viewer’s  gaze  is  attracted  to  a  strange  part  of  the  scene,  the  gaze  scans  the 
scene  to  “zero  in”  on  it,  zooms  in  on  it,  and  dwells  on  it  for  a  while,  nntil  the  new 
information  has  “snnk  in”.  These  changes  in  the  image  stream  can  be  detected  withont 
nnderstanding  the  content  of  the  stream. 

In  this  connection,  predictability  is  an  important  concept.  Frames  that  are  predictable 
are  not  as  nsefnl  as  frames  that  are  nnpredictable.  We  can  rank  predictable  frames  lower, 
since  the  viewer  can  infer  them  from  context.  Frames  of  a  new  shot  cannot  generally  be 
predicted  from  a  previons  shot,  so  they  are  important.  (Cnts  and  transitions  in  image 
streams  have  similarities  to  image  edges.)  On  the  other  hand,  camera  translations  and 
pans  that  do  not  reveal  new  objects  prodnce  frames  that  are  predictable. 

We  wonld  like  to  detect  when  the  camera  stops  (the  viewer’s  gaze  stopping  on  a 
snrprising  object).  Note  that  what  is  nnpredictable  in  this  case  is  the  camera  motion, 
not  the  image  content.  As  the  camera  slows  down,  the  image  content  stops  changing,  so 
is  qnite  predictable.  Therefore,  we  can  consider  frames  in  which  the  motion  held  changes 
as  more  relevant  than  frames  in  which  it  does  not. 

We  tnrn  to  a  signal-theoretic  view  of  video  summarization.  We  can  assnme  that  the 
original  image  stream  signal  has  tens  of  thonsands  of  dimensions  (color  components  of 
each  pixel).  We  apply  two  hltering  operations.  The  hrst  operation  can  take  the  form  of  a 
dimension  rednction  that  hnds  a  featnre  vector  for  each  frame  and  transforms  the  image 
stream  into  a  featnre  vector  trajectory,  a  signal  that  has  many  fewer  dimensions  than  the 
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original  signal  (e.g.,  37  in  one  of  the  methods  described  below).  Alternatively,  we  can 
represent  each  frame  by  a  statistical  model  that  captnres  average  characteristics  of  colors, 
and  possibly  textnre  and  motion,  as  well  as  contignity  properties.  In  this  method  too, 
we  can  view  the  original  image  stream  as  being  hltered  into  a  new  signal  in  some  (non- 
Enclidean)  space,  where  we  dehne  the  distance  between  frames  as  the  distance  between 
their  statistical  models.  Both  of  these  methods  are  described  in  the  next  section. 

As  a  resnlt  of  the  hrst  hltering  step,  we  wonld  like  the  ontpnt  signal  to  be  a  straight 
line,  or  to  remain  in  approximately  the  same  position  in  the  space,  when  nothing  of 
interest  happens,  and  to  have  a  detectable  cnrvatnre,  a  step  or  a  roof,  when  a  noteworthy 
event  takes  place. 

Noise  in  this  context  is  not  the  same  as  pixel  noise.  The  image  stream  generated  by 
a  hxed  camera  looking  from  a  window  at  a  crowd  milling  aronnd  in  the  street  may  be 
considered  to  have  a  stationary  component  and  a  visnal  noise  component,  dne  to  the 
changing  colors  of  people’s  clothes.  The  passing  of  a  hre  trnck  wonld  be  an  example  of 
a  signal  over  this  tlnctnating  bnt  monotonons  backgronnd. 

We  apply  a  second  hltering  step,  with  the  goal  of  detecting  regions  of  high  cnrvatnre 
along  the  trajectory,  and  we  rank  the  hltering  resnlts.  Since  we  expect  the  video  signal 
to  be  noisy  in  the  sense  described  above,  we  need  the  second  hltering  step  to  enhance  the 
linear  parts  as  well  as  the  parts  with  signihcant  cnrvatnre.  In  the  non-Enclidean  featnre 
space  of  statistical  frame  models,  projections  cannot  be  compnted;  thns  this  hltering 
step  shonld  nse  only  distance  measnres  between  models.  It  shonld  allow  for  hierarchical 
ontpnt,  so  that  the  nser  can  specify  the  level  of  detail  (or  scale)  at  which  he  wants  to 
view  the  frames  that  show  noteworthy  events. 

We  can  attempt  to  optimize  both  the  hrst  hltering  step  (mapping  to  the  featnre  vector 
trajectory)  and  the  second  hltering  step  (edge-roof  detection). 

2  Mapping  an  Image  Stream  into  a  Trajectory 

We  begin  with  onr  proposal  for  the  hrst  hltering  step,  motivated  in  the  previons  section. 
We  present  two  mappings  of  an  image  stream  into  a  trajectory  snch  that  the  trajectory 
is  highly  bent  when  events  of  interest  occnr  in  the  stream.  We  assign  a  point  on  the 
trajectory  to  each  frame  in  the  stream. 

Eor  the  hrst  mapping,  we  dehne  fonr  histogram  bnckets  of  eqnal  size  for  each  of 
the  three  color  attribntes  in  the  YUV  color  space  of  MPEG  encoding.  Each  bncket 
contribntes  three  featnre  vector  components:  the  pixel  connt,  and  the  x  and  y  coordinates 
of  the  centroid  of  the  pixels  in  the  bncket.  This  yields  36  components,  and  we  add  the 
frame  nnmber  (time)  to  obtain  37  components.  Thns,  the  trajectory  in  this  case  is  a 
polygonal  arc  in  IR^^.  (We  are  investigating  an  alternate  scheme  in  which  the  nnmber 
and  sizes  of  the  bnckets  are  selected  according  to  the  color  distribntion  over  the  video 
seqnence.) 

When  the  camera  translates  or  pans  smoothly  withont  seeing  new  things,  the  centroid 
components  change  linearly  and  the  featnre  vector  trajectory  is  linear.  If  the  camera 
snddenly  decelerates  (or  accelerates),  the  trajectory  has  a  high  cnrvatnre,  becanse  the 
centroids  decelerate. 
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As  an  alternative  mapping,  we  generate  a  statistical  model  for  each  frame  nsing  a 
hidden  Markov  model  (HMM)  techniqne.  We  obtain  seqnences  of  pixels  by  taking  random 
walks  across  each  frame,  moving  either  to  the  north,  east,  sonth  or  west  neighbor  with 
eqnal  probability.  When  we  hit  the  border  of  the  image,  we  jnmp  randomly  to  another 
pixel  of  the  image,  and  start  another  seqnence.  We  model  onr  observations  by  connting 
the  nnmber  of  times  we  step  to  a  pixel  of  a  certain  color,  given  that  we  come  from  a 
neighbor  of  the  same  color  or  of  a  different  color  (colors  are  the  same  if  they  are  qnantized 
to  the  same  valne;  the  qnantization  method  is  described  below).  Nnmbers  representing 
the  connts  of  transitions  from  one  color  to  another  color  can  be  stored  in  a  2D  table. 
Note  that  this  table  is  a  cooccnrrence  matrix  [6]  for  the  qnantized  colors,  except  for  the 
fact  that  some  pixels  my  be  visited  twice  and  other  pixels  may  be  missed. 

To  make  this  information  independent  of  the  nnmber  of  steps  taken,  we  can  normalize 
each  row  so  that  the  row  nnmbers  snm  to  1.  For  large  numbers  of  observations,  this  table 
is  a  color  transition  probability  matrix,  as  it  describes  the  probability  of  arriving  at  a 
certain  color  at  the  next  step,  given  that  we  are  at  a  certain  color  at  the  present  step.  In 
addition,  we  keep  track  of  the  valnes  of  the  pixels  at  the  hrst  step  of  each  new  walk  to 
compnte  a  histogram  of  the  colors  of  the  image. 

To  avoid  excessive  model  size,  the  colors  mnst  be  qnantized.  Using  HMM  terminology, 
this  operation  can  be  called  a  state  assignment  of  the  pixels,  since  we  are  saying  that 
when  a  color  is  in  a  certain  interval,  the  pixel  belongs  to  a  given  bin,  or  state.  After 
qnantization,  we  can  describe  the  image  by  a  histogram  of  the  states  and  a  state  transition 
matrix.  To  compensate  for  the  rednced  descriptive  power  of  a  statistical  model  nsing 
fewer  states,  the  HMM  describes  the  distribntion  of  each  color  within  each  bin/state.  In 
onr  experiments,  we  modeled  the  color  distribntion  within  each  state  by  three  Ganssians, 
i.e.  a  total  of  six  nnmbers.  HMM  techniqnes  (described  in  the  next  paragraph)  allow 
ns  to  compnte  a  qnantization  of  the  color  space  snch  that  in  each  bin/state,  the  color 
distribntions  are  well  represented  by  Ganssians.  The  labeling  of  pixels  into  states  is 
hidden,  in  the  sense  that  only  actnal  pixel  valnes  are  observed,  not  their  qnantized 
valnes,  and  a  compntation  assigns  the  best  states  as  follows. 

A  state  assignment  is  obtained  in  two  steps  in  an  iteration  loop.  In  the  hrst  step,  we 
compnte  the  seqnences  of  states  that  have  the  highest  probabilities,  given  the  observa¬ 
tion  seqnences  along  the  random  walks.  We  obtain  these  maximnm  probabilities  and  the 
corresponding  state  seqnences  by  a  dynamic  programming  techniqne  called  the  Viterbi 
algorithm  [12],  nsing  the  state  transition  matrix  and  the  probability  distribntion  of  obser¬ 
vations  within  each  state  (obtained  at  a  previons  iteration).  In  the  second  step,  now  that 
each  pixel  has  been  labeled  with  a  specihc  state,  we  can  recompnte  the  most  likely  state 
transition  matrix  by  tallying  the  transitions  from  state  to  state  along  the  random  walks. 
Also,  we  can  recompnte  the  most  likely  Ganssian  probability  distribntions  of  observations 
within  each  state  by  hnding  the  means  and  variances  of  the  colors  of  the  pixels  labeled 
with  that  state.  These  two  steps  are  repeated  alternatingly  nntil  there  is  no  signihcant 
improvement.  This  is  the  so-called  segmental  K-means  approach  to  compnting  a  Hidden 
Markov  model  from  seqnences  of  observations  [12].  (The  slower-converging  Banm-Welch 
algorithm  can  be  nsed  instead  with  similar  resnlts.)  The  resnlting  statistical  description 
of  the  image  consists  of  a  state  transition  matrix,  which  is  essentially  a  cooccnrrence 
matrix  for  qnantized  colors,  together  with  a  description  of  the  color  distribntions  within 
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each  bin,  and  the  probability  distribntion  of  the  states  of  the  starting  pixels  of  each 
random  walk,  which  is  a  histogram  of  the  qnantized  colors  of  the  image. 

Once  we  have  obtained  HMM  models  for  the  video  frames,  we  are  able  to  compnte 
distances  between  frames.  The  idea  behind  a  distance  calcnlation  between  two  images 
nsing  HMMs  is  to  hnd  how  well  the  HMM  of  one  image  can  model  the  other  (and  vice 
versa),  in  comparison  with  how  well  each  HMM  can  model  the  image  on  which  it  was 
trained.  To  measnre  the  modeling  ability  of  an  HMM  for  any  image,  we  can  obtain  an 
observation  seqnence  from  that  image  by  a  random  walk,  and  compnte  the  probability 
that  this  seqnence  conld  be  prodnced  by  the  HMM.  When  images  are  visnally  similar, 
this  probability  is  high. 

In  other  words,  a  distance  measnre  between  two  images  /i  and  I2  with  HMM  models 
Ai  and  A2  is  constrncted  by  combining  the  probability  that  the  observation  seqnences 
O2  obtained  by  random  walks  throngh  image  I2  conld  be  prodnced  by  the  probabilistic 
model  Ai  of  image  /i,  and  a  similar  probability  where  the  roles  of  I2  and  /i  are  reversed. 
This  qnantity  is  normalized  by  the  nnmber  of  observations,  and  compared  to  how  well 
the  HMMs  can  model  the  images  on  which  they  were  trained: 

d(Iuh)  =  -T|ogP(02|Ai)  -  TlogP(Oi|Aj)  +  TlogP(Oi|Ai)  +  TlogP(02|A2) 

(1) 

Qnantities  of  the  form  logP(Oi|Aj)  are  compnted  by  applying  the  classic  Forward  Al¬ 
gorithm  [12]  to  the  observation  seqnences  nsing  the  transition  matrix  and  probability 
distribntions  prescribed  by  the  HMM  model  Aj. 

This  distance  fnnction  dehnes  a  semi-metric  space,  becanse  it  satishes 

positivity:  d[x,  x)  =  0  and  d[x,  y)  >  0  if  x  is  distinct  from  y, 

symmetry  d{x,  y)  =  d{y,  x), 

bnt  not  the  triangle  ineqnality,  i.e.,  there  can  exist  z’s  snch  that 

d{x,  y)  >  d{x,  z)  +  d{z,  y). 

For  this  mapping,  the  trajectory  describing  a  seqnence  of  video  frames  is  also  a  polyg¬ 
onal  arc  (in  the  sense  that  it  is  a  hnite,  linearly  ordered  seqnence  of  points)  bnt  it  is  not 
contained  in  Fnclidean  space;  it  is  contained  in  a  non-linear  semi-metric  space.  This 
means  that  the  points  on  the  trajectory  cannot  be  assigned  coordinates,  and  we  can  only 
measnre  a  semi-distance  between  any  two  points. 

Distances  based  on  image  statistics  (histogram,  co-occnrrence,  HMM)  are  qnite  in¬ 
sensitive  to  image  translation,  and  therefore  prodnce  points  that  are  in  approximately 
the  same  positions  in  space  when  the  camera  motion  is  a  pan  or  a  translation. 
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3  Trajectory  Filtering  by  Polygon  Simplification 

Our  first  filtering  operation,  described  in  the  previous  section,  maps  a  video  sequence 
into  a  trajectory  that  is  a  polygonal  arc,  i.e.,  a  polyline.  The  polyline  may  be  noisy,  in 
the  sense  that  it  is  not  linear  but  only  nearly  linear  for  video  stream  segments  where 
nothing  of  interest  happens  (i.e.,  where  the  segments  are  predictable).  Furthermore,  the 
parts  of  high  curvature  are  difficult  to  detect  locally.  Therefore,  it  is  necessary  to  apply 
a  second  filtering  operation,  which  we  describe  in  this  section. 

The  goal  is  to  simplify  the  polyline  so  that  its  sections  become  linear  when  the  cor¬ 
responding  video  stream  segments  are  predictable,  which  also  means  that  the  vertices  of 
the  simplified  polyline  are  key  frames  of  the  non-predictable  video  stream  segments.  We 
achieve  this  by  repeated  removal  of  the  vertices  that  represent  the  most  predictable  video 
frames.  In  terms  of  the  geometry  of  the  polyline  trajectory,  these  vertices  are  the  most 
linear  ones.  While  it  is  clear  what  “linear”  means  in  a  linear  space,  we  need  to  define 
this  concept  for  semi-metric  non-linear  spaces. 

A  polyline  is  an  ordered  sequence  of  points.  Observe  that  even  if  the  polyline  is 
contained  in  Euclidean  space,  it  is  not  possible  to  use  standard  approximation  techniques 
like  least-square  fitting  for  its  simplification,  since  the  simplified  polyline  would  then 
contain  vertices  that  do  not  belong  to  the  input  polyline.  For  such  vertices,  there  would 
not  exist  corresponding  video  frames.  Thus,  a  necessary  condition  for  a  simplification  of  a 
video  polyline  is  that  the  sequence  of  vertices  of  the  simplified  polyline  be  a  subsequence 
of  the  original  one. 

Our  approach  to  simplification  of  video  polylines  is  based  on  a  novel  process  of  discrete 
curve  evolution  presented  in  [9]  and  applied  in  the  context  of  shape  similarity  of  planar 
objects  in  [11].  However,  here  we  will  use  a  different  measure  of  the  relevance  of  vertices, 
described  below. 

Aside  from  its  simplicity,  the  process  of  discrete  curve  evolution  differs  from  the 
standard  methods  of  polygonal  approximation,  like  least  square  fitting,  by  the  fact  that 
it  can  be  used  in  non-linear  spaces.  The  only  requirement  for  discrete  curve  evolution 
is  that  every  pair  of  points  is  assigned  a  real-valued  distance  measure  that  does  not 
even  need  to  satisfy  the  triangle  inequality.  Clearly,  this  requirement  is  satisfied  by  our 
distance  measure,  which  is  a  dissimilarity  measure  between  images. 

Now  we  briefly  describe  the  process  of  discrete  curve  evolution  (for  more  details  see 
[10]).  The  basic  idea  of  the  proposed  evolution  of  polygons  is  very  simple: 

•  At  each  evolution  step,  the  vertex  with  smallest  relevance  is  detected  and  deleted. 

The  key  property  of  this  evolution  process  is  the  order  of  the  deletion,  which  is  given  by 
a  relevance  measure  K  that  is  computed  for  every  vertex  v  and  depends  on  v  and  its  two 
neighbor  vertices  n,  w: 

K[v)  =  K(u^  n,  w)  =  d(n,  v)  +  d(n,  w)  —  d(u^  w)  (2) 

where  d  is  the  semi-distance  function.  Intuitively,  the  relevance  K[v)  reflects  the  shape 
contribution  of  vertex  v  to  the  polyline. 
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Figure  1:  Fish  silhouette  with  124  vertices  (a)  and  a  simplihed  curve  with  21  points  (b) 


Figure  2:  Video  trajectory  (a)  and  curve  simplihcation  producing  20  relevant  frames 
(black  dots)  for  “Mr.  Bean’s  Christmas”,  using  (b)  the  feature  vector  method,  and  (c) 
the  HMM  method. 

Fig.  f  illustrates  the  curve  simplihcation  produced  by  the  proposed  hltering  technique 
for  a  planar  hgure.  Notice  that  the  most  relevant  vertices  of  the  curve  and  the  general 
shape  of  the  hgure  are  preserved  even  after  most  of  the  vertices  have  been  removed. 

We  will  demonstrate  with  the  experimental  results  in  the  next  section  that  the  discrete 
curve  evolution  based  on  this  relevance  measure  is  very  suitable  for  hltering  polylines 
representing  videos. 

4  Experimental  Results 

We  illustrate  our  techniques  using  an  80-second  clip  from  a  video  entitled  “Mr.  Bean’s 
Christmas”.  The  clip  contains  2379  frames.  First,  we  applied  the  feature  vector  approach 
described  above,  in  which  a  37-dimensional  feature  vector  derived  from  centroids  and 
pixel  counts  in  histogram  bins  is  computed  for  each  frame.  A  perspective  view  of  the 
3D  projection  of  the  video  trajectory  is  shown  in  Fig.  2a.  The  two  large  black  dots  are 
the  points  corresponding  to  the  hrst  and  last  frames  of  the  video.  Curve  simplihcation 
using  the  method  described  in  Section  3  was  then  applied  to  this  trajectory.  Fig.  2b 
shows  a  simplihcation  result  in  which  only  20  points  have  been  preserved.  A  method  for 
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automatic  selection  of  the  smallest  point  count  that  can  still  provide  an  appropriate  level 
of  summarization  is  presented  at  the  end  of  the  next  section. 

Finally,  HMM  models  were  computed  for  every  hve  frames,  and  curve  simplihcation 
was  performed  using  the  probabilistic  distance  measure  described  in  Section  2.  For 
comparison  with  the  feature  vector  method,  we  chose  to  also  preserve  20  key  frames  with 
the  HMM  curve  simplihcation.  Since  this  method  does  not  provide  frame  coordinates, 
we  plotted  the  20  points  that  correspond  to  these  20  frames  along  the  trajectory  found 
by  the  feature  vector  approach,  in  order  to  give  an  idea  of  the  locations  of  the  20  frames 
in  the  video  (Fig.  2c).  Clearly,  the  HMM  method  located  its  key  frames  on  segments  of 
sudden  change  of  that  trajectory,  i.e.  in  regions  of  signihcant  change  in  the  video  clip. 

Next  we  discuss  the  quality  of  the  summaries  produced  by  curve  simplihcation  using 
the  feature  vector  and  HMM  methods.  Knowing  the  content  of  each  shot  of  the  clip  is 
helpful  for  this  comparative  evaluation. 

1.  Frames  1  to  996:  Mr.  Bean  carries  a  raw  turkey  from  a  kitchen  counter  to  a  table. 
He  cuts  a  string  that  tied  the  legs,  brings  a  bowl  of  stufhng  closer  and  starts  pushing 
stufhng  inside  the  turkey. 

2.  Frames  997  to  1165:  He  notices  that  his  watch  is  missing. 

3.  Frames  1166  to  1290:  He  looks  inside  the  turkey,  then  pulls  stufhng  out  of  the 
turkey  to  retrieve  his  watch. 

4.  Frames  1291  to  1356:  He  keeps  removing  stufhng. 

5.  Frames  1357  to  2008:  He  tries  to  look  inside,  then  uses  a  hashlight  to  try  to  locate 
his  watch  inside  the  turkey.  Finally,  he  bends  toward  the  turkey  to  explore  more 
deeply  by  putting  his  head  inside  the  turkey. 

6.  Frames  2009  to  2079:  The  lady  friend  whom  he  invited  for  Christmas  dinner  rings 
his  doorbell. 

7.  Frames  2080  to  2182:  Hearing  the  bell,  Mr.  Beans  stands  up  with  his  head  stuck 
inside  the  turkey.  He  vainly  attempts  to  remove  the  turkey. 

8.  Frames  2183  to  2363:  He  walks  blindly  toward  the  door  with  the  turkey  over  his 
head,  bumping  into  things. 

9.  Frames  2364  to  2379:  His  lady  friend  waits  outside  for  the  door  to  open... 

Fig.  3  shows  two  storyboards  obtained  by  curve  simplihcation.  Storyboard  (a)  results 
from  the  curve  simplihcation  obtained  by  the  feature  vector  method.  The  frames  corre¬ 
spond  to  the  vertices  of  the  simplihed  polyline  in  Fig.  2b.  Storyboard  (b)  results  from 
the  HMM  method.  The  frames  correspond  to  the  vertices  of  the  simplihed  polyline  in 
Fig.  2c. 

Both  storyboards  seem  to  be  reasonable  summarizations  of  the  short  video  clip.  The 
feature  vector  method  misses  Shot  9  (the  last  shot),  and  oversamples  Shot  2  (where  he 
notices  that  his  watch  is  missing)  with  8  frames.  The  cause  of  this  oversampling  was 
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traced  to  the  histogram  quantization  of  the  feature  vector  method  (Section  2).  Pixels  of 
colors  located  half-way  between  the  representative  colors  of  histogram  bins  could  switch 
their  assignments  from  one  bin  to  the  next  with  a  small  variation  of  color  between  suc¬ 
cessive  frames.  In  rare  instances,  a  relatively  large  number  of  pixels  may  flip  back  and 
forth  between  two  neighboring  bins,  causing  large  jumps  in  the  feature  vector  components 
which  result  in  large  meaningless  peaks  in  the  polygonal  line  representing  the  video.  The 
ffMM  method  is  not  affected  by  such  quantization  artifacts;  it  represents  each  shot  with 
at  least  one  frame,  and  selects  frames  that  tell  more  of  the  story,  such  as  the  string  cut¬ 
ting  of  frame  38f  and  the  flashlight  episode  of  frame  f77f.  A  more  thorough  evaluation 
could  be  obtained  by  comparison  with  ground-truth  provided  by  humans  who  view  the 
clips  and  select  small  numbers  of  frames  as  most  descriptive  of  the  stories. 

5  A  Video  Player  with  Smart  Fast-Forwarding 

An  interesting  application  of  video  summarization  is  to  the  design  of  a  smart  VCR  fast- 
forwarding  that  samples  only  the  most  relevant  frames.  We  have  developed  a  .Java  video 
player  that  plays  video  clips  in  MPEG  format,  and  can  play  the  whole  video  at  the  normal 
rate,  or  show  only  the  frames  of  highest  relevance(Fig.  4a).  A  vertical  relevance  slider 
on  the  right-hand  side  of  the  window  lets  the  user  dehne  the  number  of  frames  that  he 
has  the  patience  to  watch.  For  example,  the  “Mr.  Bean’s  Christmas”  video  clip  contains 
2379  frames,  so  that  playing  it  takes  around  80  seconds.  The  user  may  choose  to  watch 
only  the  20  most  relevant  frames  and  moves  the  slider  up  until  the  box  at  the  left  of  the 
sliding  elevator  indicates  20.  Then  the  player  skips  all  but  the  20  most  relevant  frames. 
The  buttons  at  the  bottom  of  the  player  window  dehne  VCR-type  functions:  Play,  Pause, 
Fast-Forward,  Fast-Backward,  frame-by-frame  forward  stepping,  and  backward  stepping. 
In  all  these  modes,  only  the  relevant  frames,  as  dehned  by  the  relevance  slider,  are  played. 

A  horizontal  sampling  stripe  located  under  the  image  display  panel  shows  the  positions 
of  the  relevant  frames  within  the  video.  It  is  a  black  stripe  that  shows  a  white  vertical 
tick  mark  for  each  displayed  frame.  A  triangular  frame  marker  slides  below  the  sampling 
stripe  as  the  video  clip  is  being  played,  and  indicates  which  frame  is  being  displayed. 
Navigation  through  the  video  can  also  be  performed  by  dragging  this  triangular  frame 
marker.  This  mode  of  navigation  is  called  “scrubbing”  by  video  editing  practitioners.  It 
is  set  to  let  the  user  visit  all  the  frames,  not  just  the  relevant  frames. 

Video  clips  are  selected  from  a  pop-up  menu.  The  user  can  also  select  different  types 
of  relevance  measures  from  a  second  pop-up  menu.  The  relevances  presently  available 
in  the  video  player  have  been  precomputed  from  Euclidean  distances  between  feature 
vectors  from  histogram  bins  and  from  HMM  distances,  both  described  above,  as  well  as 
from  Euclidean  distances  between  motion  feature  vectors,  described  in  [16].  We  plan  to 
add  other  hltering  choices,  such  as  relevances  based  on  the  presence  of  faces,  music  and 
talk  content  in  the  sound  track. 

When  a  new  video  is  selected  for  viewing,  the  vertical  relevance  slider  is  initially 
positioned  at  a  default  position  which  shows  only  a  small  number  of  relevant  frames.  This 
number  is  precomputed  for  each  available  type  of  relevance  measure  using  a  histogram 
slope  technique.  Cumulative  histograms  that  represent,  for  any  given  relevance,  the 
number  of  frames  that  have  larger  relevance,  are  found  to  have  similar  shapes  (Fig.  4  (b)): 
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most  of  the  frames  have  small  relevances;  these  frames  have  small  variations  with  respect 
to  their  neighbors.  Very  few  frames  have  large  relevances.  The  two  regions  are  separated 
by  a  sndden  slope  change.  We  wish  to  ignore  the  many  frames  with  small  relevances 
and  show  the  few  frames  with  large  relevances;  therefore  we  select  the  cntoff  relevance 
at  the  slope  break  between  the  two  regions,  aronnd  slope  —1.  For  the  histogram  of  Mr. 
Bean’s  video,  this  corresponds  to  aronnd  27  frames.  After  exploring  the  video  at  several 
relevance  slider  positions,  the  nser  can  retnrn  the  slider  to  its  defanlt  position  by  clicking 
the  bntton  labeled  “Reset  Sampling”. 

This  prototype  nses  MPEG-decoding  Java  sonrce  code  written  by  J.  Anders  [1]. 

6  Related  Work  and  Discussion 

ffnang  et  al.  [8]  showed  that  nsing  more  descriptive  statistical  models  of  images  snch 
as  correlograms  signihcantly  improves  retrieval  performance  of  images,  in  comparison  to 
simple  statistical  descriptions  snch  as  histograms.  However,  they  do  not  have  a  method 
for  selecting  the  right  balance  between  the  size  of  the  correlogram  and  the  discriminative 
power  of  the  model.  Second,  they  apply  Enclidean  distances  to  their  statistical  models. 
Conseqnently,  they  have  to  give  different  weights  to  the  Enclidean  coordinates  depending 
on  the  sitnation.  In  onr  view,  the  nse  of  hidden  Markov  models  of  images  elegantly 
addresses  these  issnes,  by  (1)  snpplementing  coarse  color  qnantization  with  a  description 
of  color  distribntions  within  each  bin,  while  antomatically  adjnsting  each  bin  size  to  make 
this  description  optimal,  and  (2)  addressing  the  distance  issne  by  allowing  an  intnitive 
probabilistic  dehnition  of  image  distance. 

In  [3],  we  described  how  the  Ramer-Donglas-Pencker  method  of  polygon  simplih- 
cation  [13]  conld  provide  effective  snmmarizations  of  videos.  This  method  is  a  binary 
cnrve  splitting  approach  that  at  each  step  splits  the  arc  at  the  point  fnrthest  from  the 
chord,  and  stops  when  the  arc  is  close  to  the  chord.  However,  for  N  video  frames  it  has 
time  complexity  which  is  prohibitive  for  large  videos  and  complex  distance  measnres. 
Variants  that  rednce  the  complexity  to  N  log  N  cannot  be  applied  to  mnltidimensional 
video  trajectories,  as  they  make  nse  of  planar  convex  hnlls  [7].  In  addition,  the  com- 
pntation  of  the  distance  between  an  arc  and  its  chord  reqnires  the  nse  of  Enclidean 
distances.  The  cnrve  simplihcation  techniqne  we  have  proposed  can  be  shown  to  be  of 
order  N  log  N  and  can  accommodate  non- Enclidean  distances.  These  two  featnres  make 
the  nse  of  probabilistic  image  distances  practical  for  video  summarization. 

The  reader  interested  in  video  browsing  research  can  refer  to  [14,  15,  17]  and  to  the 
recent  work  of  Foote  [5]. 

7  Conclusions  and  Future  Work 

In  this  work,  we  have  proposed  and  implemented  a  system  for  antomatically  providing 
snmmaries  of  videos  whose  size  can  be  controlled  by  the  nser.  The  method  applies  a 
novel  hne-to-coarse  polyline  simplihcation  techniqne  that  compntes  for  each  vertex  a 
relevance  measnre  based  on  its  two  neighbors  and  at  each  step  removes  the  least  relevant 
vertex  and  npdates  the  relevances  of  the  affected  neighbors.  The  proposed  relevance 
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measure  is  valid  for  non-metric  spaces.  This  allows  us  to  compute  relevances  using 
a  probabilistic  distance  measure  between  hidden  Markov  models  of  the  video  frames. 
We  produce  reasonable  summaries  by  showing  the  most  relevant  frames  in  temporal 
order.  We  have  implemented  a  video  player  that  incorporates  this  technology  to  let 
the  user  perform  a  smart  fast-forwarding  that  skips  the  more  predictable  frames.  A 
vertical  slider  lets  the  user  dehne  the  number  of  relevant  frames  he  has  the  patience 
to  watch.  We  are  currently  investigating  improved  random  sampling  of  images  for  the 
HMM  calculation  using  quasi- random  walks  [2],  as  well  as  summarization  results  for  a 
2D  HMM  technique  [4].  We  are  also  improving  our  video  player  to  let  the  user  select  a 
region  of  a  frame  and  retrieve  the  frames  that  have  the  shortest  HMM  distances  to  that 
region. 
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Figure  3:  Storyboards  obtained  by  curve  simplification  of  the  video  trajectory  obtained 
by  the  feature  vector  method  (a)  and  by  the  HMM  method  (b) 
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Figure  4:  (a):  Video  player  with  vertical  slider  for  control  of  summarization  level,  (b): 
Cumulative  histogram  giving  proportion  of  frames  with  relevances  larger  than  a  given 
number,  used  to  determine  default  summarization  level. 
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